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Abstract 

In this paper , efficient dynamic simulation algorithms for a system of m manipulators, 
cooperating to manipulate a large load , are developed; and their performance , using two 
possible forms of parallelism on a general-purpose parallel computer , is investigated. One- 
form. t temporal parallelism , is obtained with the use of parallel numerical integration methods 
[1]. A speedup of 3.78 on four processors of a CRAY Y-MP8 was achieved with a parallel 
four-point block predictor- corrector method for the simulation of a four manipulator system. 
These multi-point methods suffer from reduced accuracy, and when comparing these runs 
with a serial integration method , the speedup can be as low as 1.83 for simulations with the 
same accuracy. To regain the performance lost due to accuracy problems , a second form of 
parallelism is employed. Spatial parallelism allows most of the dynamics of each manipulator 
chain to be computed simultaneously. Used exclusively in the four processor case, this form 
of parallelism in conjunction with a serial integration method results in a speedup of 3. 1 on 
four processors over the best serial method. In cases where there are either more processors 
available or fewer chains in the system, the multi-point parallel integration methods are 
still advantageous despite the reduced accuracy because both forms of parallelism can then 
combine to generate more parallel tasks and achieve greater effective speedups. This paper 
also includes results for these cases. 

1. Introduction 

With increases in the structural complexities of today’s systems, such as multiple manip- 
ulators, multilegged vehicles, and flexible robots, parallelization of the dynamic algorithms 
for these systems must be considered in an effort to improve computational rates. With 
significant speedups over previous implementations, real-time performance of graphic ani- 
mation would make man-in-the-loop remote control of these systems feasible [2]. And with 
super-real-time simulation (computing seconds of motion in milliseconds), an entirely new 
approach to on-line robotic control using predictive simulation for planning is within range 
[3]. One promising area of research which is striving to achieve these computational rates 
focuses on the use of parallel algorithms. 

In previous work, fine-grain parallel algorithms have been developed for robot dynamics 
computations. Some require the use of special-purpose architectures to implement the fine- 
grain parallelism of the computations required for a single-chain system [4, 5, 6, 7]. Others 
decompose the algorithm into groups of concurrent tasks that are scheduled on a number of 
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tightly coupled parallel processors to produce speedup [8, 9], All of these algorithms, while 
designed to perform well on specific parallel architectures, are usually much less efficient 
when implemented on available general-purpose parallel machines with a limited number of 
processors. These systems were not designed to handle the large amount of interprocessor 
communication and synchronization that is required by the fine-grain tasks. 

In our work, parallel algorithms which run efficiently on existing general-purpose parallel 
computers are investigated. In the initial stage of this research, the dynamic algorithms 
required to simulate a single, open-chain robotic manipulator were effectively parallelized 
[l]. The approach used a parallel block predictor-corrector integration method to achieve 
a temporal parallelism which enabled the forward dynamics problem to be computed for 
multiple points in time simultaneously. In this paper, the work is extended to simulate 
systems of multiple manipulators that are cooperating to manipulate a large load. In this 
case, a second form of parallelism which corresponds to the system’s structural parallelism 
is explored. Called spatial parallelism, it allows most of the dynamics of the individual 
chains to be computed in parallel as well. 

This work shows that there are various costs associated with the two forms of paral- 
lelism which affect their efficiency. With temporal parallelism, the accuracy of the parallel 
integration methods is lower than the corresponding serial methods. As a result, smaller 
stepsizes must be used, and hence, more computation must be performed with the tempo 
ral parallel methods to achieve the same accuracy. With spatial parallelism, most of the 
dynamics may be computed in parallel; however, a serial portion of the algorithm which 
computes the dynamics of the common load reduces the overall efficiency of this parallelism 
as well. The advantage of these methods is that they may be combined by implementing the 
spatial parallelism within each parallel integration task to gain even greater speedups. Our 
results show the effects of reduced efficiency of both methods and how they are combined 
to gain the greatest speedups on varying numbers of processors. 

In the following section, the algorithm to simulate the multiple manipulator system 
is developed. In this development, the dynamics used in the previous work for a single 
manipulator are extended and the spatial parallelism in this algorithm is presented. In 
Section 3, the block predictor-corrector integration methods are presented which provide 
various amounts of temporal parallelism in the simulation problem. Section 4 discusses 
how both forms of parallelism are combined and implemented on a general-purpose parallel 
computer. This algorithm is then implemented on the CRAY Y-MP8/864, and the speedup 
results are given in Section 5 for various configurations of the simulation system including 
spatial parallelism only, temporal parallelism only, and a combination on various numbers 
of the CRAY’s processors. Finally, a summary and conclusions are presented in Section (). 


2. Parallel Algorithm for Multiple Manipulator Dynamic Equations 

In this section, an efficient dynamic simulation algorithm for a multiple, closed-chain 
manipulator system is developed, and the spatial parallelism in the algorithm is investi- 
gated. The system contains m manipulators, each with N degrees of freedom, that are 
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rigidly grasping a common load, called the reference member. Each simple, closed chain 
(manipulator) is governed by the dynamic equations of motion for a single chain . Numbering 
the chains in the system 1 through m, the equation for each chain is given as follows: 

Tk = H*;(q*;)qfc + Cfc(qjfc, qA:) + Gjt(qfc) + Jfc(qfc)ffc for all k = 1, . . .,m, (1) 

where 


Tk 

q k, q*, qit 

H* 

c k 

Gk 

f* 


N X 1 vector of torques/forces of the joint actuators, 

N x 1 vectors of joint positions, rates, and accelerations, 

N X N symmetric, positive definite inertia matrix, 

N x 1 vector specifying centrifugal and coriolis effects, 

N x 1 vector specifying the effects due to gravity, 

6 X N Jacobian matrix, and 

6x1 force exerted by the tip of chain k on the reference member. 


The “for all” in Eq. (1) indicates that the equation may be computed for all chains in parallel 
provided the required quantities are known. Since we are interested in the solution to the 
Forward Dynamics (or simulation) problem, the state of the system, consisting of joint 
positions and rates, and the input joint torques/forces are given. The joint accelerations for 
each chain must then be computed. Lilly and Orin [10] present an efficient serial algorithm 
for Forward Dynamics which is the basis for the parallel implementation used in this paper. 

By grouping some terms from Eq. (1) and solving for the joint accelerations, we obtain 
the following equation: 

qfc = qt op - - for all k — 1 , . . . , m, (2) 

where qjt op( . n is the vector of joint accelerations for chain k if it were not contacting the 
reference member causing the tip force to be zero. This is called its open-chain solution 
which was implemented in [1]. Note that this quantity, as well as H j. * Jj, depends only on the 
system state and input joint torques/forces and may be computed from known quantities. 

An equivalent operational-space formulation for the dynamics of a closed-chain manip- 


ulator can be found by using the following relationship: 

x/t = ( 3 ) 

where is the Cartesian velocity of the tip of the manipulator. Taking the time derivative 
of both sides yields the equation for the closed-chain acceleration of the tip: 

xjt = Jfcq*. + Jjtq*7 0) 

and substituting from Eq. (2) yields the following: 

X* : = y.k opcn - for all k = l,...,m, (5) 

where 

^fcop«n = 3 k*Lk + J^qfc open? and (G) 
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(7) 


A it = (J.H-Jl) 

The quantity, Xfc ope „, is the acceleration of the tip of chain k if it were open, and A* is the 
operational space inertia matrix [11]. Both quantities may be computed given the current 
state of the system and the input. The physical interpretation of A* is an inertial quantity 
which combines the dynamic properties of the chain and projects them to its tip. It is 
the physical resistance that is “felt” when a force is exerted on the tip, and it defines the 
relationship between the force that is applied to the tip, and the resulting acceleration of 
the tip of the chain, x The advantage to using these operational space quantities is that 
the matrix sizes are constant regardless of the number of degrees of freedom in the chain. 

With the chains attached with a fixed grip to the reference member, the accelerations 
of the tips of each chain, at an appropriate point attached to the end-effector, are the same 
and are equal to the reference member acceleration, a r . As a result, Eq. (5) can be rewritten 
as follows: 

a r = Xk BP en ~ A fc l f* for all k = l,...,m. (8) 

Now the dynamic behavior of the reference member can be determined using a spatial 
force balance equation. This states that the sum of the spatial forces exerted by the tips of 
the chains and any other external forces (including gravity) is related to the acceleration of 
the reference member through its inertia. This may be written as follows: 

m 

^fA.. + g r = I r a r + b r , (9) 

A r=l 


where 

g r 6x1 vector specifying the force of gravity exerted on the reference member, 
a r 6x1 acceleration vector of the reference member, 

I r 6x6 spatial inertia of the reference member [10], 


and the last term, b r , is a spatial vector specifying the bias force due to the spatial velocity 
of the reference member. 


In this work, we also assume that the chains are not in singular configurations, and 
consequently, all of the A^ 1 matrices are non-singular. Therefore, the unknown force terms, 
f may be isolated in Eq. (8) and substituted into Eq. (9). After collecting terms, the 
following equation results: 


m 

1 r + A*: 


k= 1 


a r 


m 

g r - b r + A iXfc 

open I * 

fc=i 


( 10 ) 


This equation states that the sum of all the inertias of the system multiplied by the reference 
member acceleration is equal to the sum of all the forces that are exerted on the reference 
member. The acceleration term, a r , may now be determined from this linear system of 
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chain k-1 


chain k 


chain k+1 



equations using any linear system solver (in this case Cholesky decomposition can be used). 
Finally a r is substituted back into Eq. (8) to determine the tip forces, f*, on each chain and 
this can then be used to determine the joint accelerations from Eq. (2). 

The flowchart in Figure 1, outlines the steps in the algorithm to compute the desired ac- 
celerations. This constitutes the “derivative computation” which uses the state information 
that is provided by a numerical integration routine. The additional computation required 
for the closed chain dynamics that were not present in the open-chain algorithm used in 
[1] is indicated with the appropriate equation numbers except for the computation of the 
manipulator Jacobian. With slight modification to the inertia matrix routine that was used 
in [1], this matrix may be computed as well. The algorithm for this computation can also 
be found in [12]. Note that the Open-Chain Dynamics (OCD) and Closed-Chain Dynam- 
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Figure 2: Spatial Parallelism in Multiple-Chain Dynamic Equations (Serial RMA Compu- 
tation). 

ics (CCD) blocks are repeated for each chain in the system, while the Reference Member 
Acceleration (RMA) is performed once for a given derivative evaluation. This implies that 
the OCD and CCD computations for each chain may be executed in parallel. 

Figure 2 shows how this algorithm, along with a numerical integration method, would 
be implemented on a general-purpose parallel computer. The CS (Compute State) block 
consists of the integration method that is used to compute the state of the required chain, 
and will be discussed in the next section. A processor synchronization is required after the 
parallel OCD computation in order to collect the chains’ operational space inertia matrices 
and open-chain accelerations before the serial RMA computation is performed. In Figure 2 
this is shown as a barrier which holds the parallel tasks which have completed the OCD 
section until all m of them have finished. After the serial computation is completed by one 
processor, it signals the other processors to continue with the parallel CCD computations. 
This is accomplished by posting an event, signal for which the other processors are waiting. 

This computation can be modified to remove the event synchronization as shown 
in Figure 3. This is accomplished by allowing each task to maintain its own “copy” of the 
reference member information and perform the same computation on all of them. While 
this introduces redundant computation, it can actually reduce the wallclock time because a 
costly synchronization has been removed and the same time that was spent waiting for one 
processor to perform the RMA calculation is now used to perform it on all of the processors. 
This algorithm is then used within the framework of various parallel integration methods 
that are described in the next section. 
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Figure 3: Spatial Parallelism in Multiple-Chain Dynamic Equations (Redundant Parallel 
RMA Computation). 


3. Parallel Integration Methods 

Many different algorithms exist to perform numerical integration. One set of algorithms 
which has shown in the past to be readily parallelizable are predictor-corrector methods. 
The standard serial algorithm, commonly abbreviated PECE, consists of two pairs of steps 
which correspond to the letters of the abbreviation. The steps consist of Predicting the next 
state and Evaluating the derivative based on this prediction, and then Correcting the state 
with a corresponding derivative Evaluation. The derivative evaluations required in both 
steps to determine the accelerations were presented in the last section, and the methods for 
predicting and correcting the states are described here. 

Tn this paper, fifth-order methods are used because they provide an adequate tradeoff 
between accuracy, which tends to increase with order, and stability, which tends to decrease 
with order. Figure 4 shows the quantities needed and computed in both steps of the serial 
method. This method, usually called PECE5, will be referred to as B1PC5 in the rest of 
this paper to indicate a one-point block method which utilizes fifth-order predictor-corrector 
formulas. The predictor step, shown in Figure 4(a), uses a linear combination of five past 
derivative values, f— 4 , . . *,fo (hence it is fifth-order) and the state at the most recent point 
in time, y 0 to predict the state of the system (joint and reference member positions and 
rates) at the next point in time, yf. With this, the algorithm in the previous section can 
be used to compute the derivative (joint and reference member accelerations) at this point, 
fj\ which is based on the predicted state. To correct the state in the second step of the 
method, the quantities shown in Figure 4(b) are used. In this step, the “oldest” derivative 
value is dropped and the linear combination used to compute the corrected state, y), now 
includes the new predicted derivative value. The same state quantity, y 0 , is used rather 
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Figure 5: Four-Point Parallel B4PC5 Integration: (a) predictor step, (b) corrector step. 



than the predicted one for reasons of stability and accuracy of the method. Finally, the 
corrected derivative value is computed using y^. Then in the next iteration of the method, 
the values are shifted and four old derivative values and the new corrected derivative and 
state values are used. 

A parallel version of this method called the block predictor-corrector method was first 
formulated by Birta and Abou-Rabia [13] and used in our previous work [1]. The fifth- 
order version of this method, B4PC5, is shown in Figure 5. In this method, a block of 
four points are computed in parallel during a single iteration. Each processor is responsible 
for computing the required quantities at a single point in the block. In the predictor step 
shown in Figure 5(a), each processor computes the predicted value for its point using the 
same information (but with a different linear combination) in parallel. Then each uses 
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Figure 6: Two-Point Parallel B2PC5 Integration: (a) predictor step, (b) corrector step. 

the dynamics algorithm to compute the derivative at its point in parallel as well. In the 
corrector step, an entire block of old derivative values are discarded in favor of the predicted 
ones as shown in Figure 5(b). Once the processors have swapped the derivative information 
from the predictor step, correction and subsequent derivative evaluation can again be done 
in parallel. 

The B4PC5 method offers a way to simulate the system using four processors in parallel. 
Because the method extrapolates farther from the known values in the predictor step, it 
is subject to larger errors for a given stepsize. Also, when both forms of parallelism are 
combined in the next section, the number of parallel tasks may exceed the numbei of 
processors available. For these reasons, we developed a two-point parallel method called 
B‘2PC5 which lies between the two methods described thus far in terms of parallelism and 
accuracy. This method is shown in Figure 6. Instead of predicting four points as in the 
B4PC5, it only predicts two new points and thus can be implemented in parallel on two 
processors. The coefficients for both steps of this method were computed using the method 
of undetermined coefficients so that the resulting method is fifth-order as well. 

The structure of the temporal parallelism for a single block of the parallel integration 
methods presented in this section is shown in Figure 7. The PE (compute Predicted state 
and Evaluate the derivative) and CE (compute Corrected state and Evaluate the derivative) 
blocks in this figure make up the predictor and corrector steps in these methods along with 
the derivative evaluations described in the previous section. There are p parallel tasks which 
correspond to the number of block points computed by the method during a single iteration. 
A single column of blocks represents the computation of the solution of the entire m-chain 
system for a specific point in time on one of p processors. Barriers are used to synchronize 
these parallel tasks after each derivative evaluation so that state and derivative information 
may be exchanged. An event is also used so that a small amount of serial processing may 
be performed by a single processor before starting the next integration iteration. 
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Figure 7: Temporal Parallelism in Dynamic Simulation. 

4. Implementation of Combined Parallelism 

Both forms of parallelism, spatial and temporal, can be combined by introducing the 
spatial parallelization of the derivative evaluations in Figure 3 into the individual PE/CE 
computational blocks of Figure 7. In this case, the CS (Compute State) blocks of Figure 3 
become the predictor or corrector equations of the desired integration method. A modi- 
fication is then made to decrease the required synchronization between all of the tasks to 
improve the performance. This modification is made to the barriers and events associated 
with the temporal parallelism. Because the integration of the state of each chain depends 
only on its own values at all of the block points, the temporal barriers are broken apart 
so that each one only synchronizes with the processors associated with the same chain. 
In some sense, these temporal synchronization points have been spatially parallelized. In 
addition, the serial Shift Blocks task can also be spatially parallelized. 

Figure 8 shows the resulting implementation which consists of as many as mp paral- 
lel tasks. Also included in this figure are the synchronization commands that are used in 
the implementation on the CRAY Y-MP8/864. The simulation code was written in FOR- 
TRAN (Version 5.0 running under UNICOS Release 6.0) using macrotasking commands to 
implement the parallelism. The BARSYNC commands correspond to the synchronization 
barriers and the argument supplied to this command corresponds to the number of parallel 
tasks that must be synchronized by it. Finally, the EVPOST-EVWAIT commands pro- 
vide the mechanism by which certain processors are suspended while others perform serial 
operations. 

In order to examine the effects of various configurations on the parallelism (taking the 
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Figure 8: Structure of Combined Spatial and Temporal Parallelism. 

number of integration block points, number of chains, and number of physical processors 
into account), a mechanism was included in the implementation which allows the user to 
specify the number of desired parallel tasks along both the temporal and spatial dimensions 
of the simulation regardless of the number of chains and block points. These numbers are 
denoted by m and p, respectively, and are used in place of m and p in Figure 8 when 
referring to the number of actual parallel computational blocks. With this mechanism, 
for example, the required computation could be paired off and a single processor could be 
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Table 1: Speedup Results for B4PC5. 


# Processors 
Requested 

Chain 
Tasks, m 

Integration 
Tasks, p 

T 

(sec.) 

s p 

i 

i 

i 

0.241 

1.00 

4 

i 

4 

0.0637 

3.78 


2 

2 

0.0668 

3.61 


4 

1 

0.0731 

3.30 

8 

2 

4 

0.0390 

6.18 


4 

2 

0.0417 

5.78 


responsible for the computations of two chains (or two block points, or both) instead of 
just one. This would allow the computation of a larger system, in which mp exceeded the 
number of processors, to be efficiently mapped to smaller parallel machines. In order to 
maintain the load balance, however, the specified partitions in both dimensions should be 
integer divisors of the total number of block points and chains, respectively. Results for 
various partitions of the computation using this method are given in the next section. 


5. Results 

With the parallel algorithm described in the last section, a system consisting of four 
PUMA 560 manipulators was simulated. A test trajectory with a duration of one second 
was generated which consisted of lifting a 4kg object 0.8 meters straight up. Then an 
appropriate joint torque profile was computed which, when applied to the joints of the 
manipulators, would produce the desired motion. These torques were used as input into 
the simulator, and the error in the final position of the reference member was used as a 
measure of accuracy for the various integration methods. Only this value was reported for 
brevity and also its accuracy was representative of the accuracy of the rest of the states in 
the system. 

The simulation was then executed using the B4PC5 integration method on 1,4, and 
8 of the Y-MP’s eight processors. Using the mechanism for partitioning the computation 
among existing processors, the block point and chain computations were partitioned singly 
(one chain and one block point per processor), in pairs (two chains and two blocks points), 
all together, or any useful combination thereof. A fixed integration stepsize was used and 
varied over a number of runs so that a profile of execution time versus error was produced. 
To get a fair comparison of relative performance between the different integration methods 
in later experiments, an estimate of the execution time, T, required to achieve an error of 
10 -6 meters was reported in Table 1. 

Examining these execution times for a given number of processors, it can be seen that 
they increased as the number of temporal parallel tasks, p, decreased. When p is less 
than four, the full amount of temporal parallelism provided by this integration method 
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Table 2: Accuracy Performance for Various Configurations. 


# Processors 
Requested 

Chain 
Tasks, 77i 

Integration 
Tasks, p 

Block 

Size 

T 

(sec.) 

S p 


St 

1 

i 

i 

4 

0.241 

1.00 

0.485 

0.485 




2 

0.152 

1.00 

0.770 

0.770 




1 

0.117 

1.00 

1.00 

1.00 

4 

i 

4 

4 

0.0637 

3.78 

0.485 

1.83 


2 

2 

4 

0.0668 

3.61 

0.485 

1.75 




2 

0.0444 

3.42 

0.770 

2.63 


4 

1 

4 

0.0731 

3.30 

0.485 

1.60 




2 

0.0463 

3.28 

0.770 

2.53 




1 

0.0377 

3.10 

1.00 

3.10 

8 

2 

4 

4 

0.0390 

6.18 

0.485 

3.00 


4 

2 

4 

0.0417 

5.78 

0.485 

2.80 




2 

0.0295 

5.15 

0.770 

3.97 


is not being fully utilized. As a result, increased numbers of redundant reference member 
acceleration (RMA) calculations are being performed by each processor in an efTort to avoid 
the extra synchronization point that was discussed at the end of Section 2. Consequently, 
the best speedups due to parallelization, 5 P , were achieved by using the greatest amount of 
temporal parallelism which was as high as 3.78 on four processors. Runs were also made 
while requesting all eight of the Y-MP’s processors and a speedup of 6.18 was still achieved 
despite an inability to gain dedicated use of these processors in our multi-user environment. 

When p was less than four, simulations using an integration method with a smaller block 
size, such as B2PC5 or serial B1PC5, were also tried. Because these methods compute 
fewer points per iteration, the cost that is incurred is an increased number of iterations 
(and hence, parallel task synchronizations) than the B4PC5 method for a given integration 
stepsize. However, the overall efficiency of these methods increased because they were more 
accurate. This resulted in fewer iterations and less execution time for a given amount of 
error. The complete set of results using the various integration block sizes as well as parallel 
configurations is shown in Table 2. 

In this table, S p is the speedup for a given method over the serial runtime using the 
same integration method. Since there is an accuracy loss associated with the larger block 
methods a speedup due to algorithmic changes, S a , is also reported. This corresponds to the 
amount of time relative to the B1PC5 method that is required by the methods to compute 
the trajectory with a given error. Based on the serial results for the three integration 
methods, the traditional serial method, B1PC5, took less than half the time to simulate 
the trajectory to the desired accuracy as the B4PC5, and the performance of B2PC5 fell in 
between. The last column shows the total effective speedup of the various configurations. 
This is based on the time required for the method to simulate the trajectory to the desired 
accuracy as compared to the best serial time which was exhibited by the B1PC5 method, 
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and it is equal to the product of S p and S a . When compared with the best integration 
methods the S p speedups obtained with the B4PC5 method must therefore be cut in about 
one-half, and the B2PC5 speedups are reduced by approximately 23%. 

Therefore, the best effective speedups were obtained using the integration method with 
the smallest block size while partitioning the computation so that the number of parallel 
tasks was equal to the number of processors. Since the system had four chains, the serial 
B1PC5 method was the best on four processors and the resulting speedup was 3.1. On 
eight processors, where the serial method could not be used and still have eight parallel 
tasks, the B2PC5 method showed the greatest speedup at 3.97. From these results it would 
appear that the B4PC5 has no advantage. However, if a system with a smaller number of 
chains is to be simulated, this method would allow a greater number of parallel tasks to be 
generated than the other methods and would most likely exhibit the greatest speedup. 

6. Summary and Conclusions 

In this paper, two approaches for achieving effective parallelization for dynamic sim- 
ulation on a general-purpose parallel computer were presented. One approach that was 
discussed in [1] for parallel simulation of a single chain system was based on temporal par- 
allelism achieved with the use of a parallel numerical integration method. In this paper, the 
work has been extended to include multiple chain systems which introduce a second form 
of parallelism. Called spatial parallelism, the form comes from the ability to compute the 
dynamics of individual chains in the system simultaneously. 

Various ways to use both forms of parallelism to the greatest advantage were investi- 
gated. The greatest effective speedup from these methods was gained by partitioning the 
computation into as many load balanced parallel tasks as possible while using the integra- 
tion method with the smallest block size. This implies that the greatest amount of spatial 
parallelism, and the most accurate integration methods should be employed. 

With the general rule in mind, our results for the simulation of a four chain system 
showed that the greatest speedup on four processors of the CRAY Y-MP8 was 3.1. This 
was achieved with spatial parallelism only, and the use of the serial predictor-corrector 
integration method. Even greater speedup was achieved on eight processors when full 
spatial parallelism was used. In this case, a two-point parallel integration method was 
used to achieve the desired amount of parallel tasks. And it appears that the four-point 
integration method would be beneficial if sixteen processors are available. 

An additional benefit to these forms of parallelism is that they do not preclude any of 
the previous work mentioned in the introduction that dealt with the fine-grain parallel algo- 
rithms for computation of the robot dynamics quantities. The special-purpose architectures 
required could be set up in parallel and used in conjunction with the methods discussed 
in this paper. The resulting combination of parallel computation could be thought of as 
occurring in three dimensions, and shows promise for even greater speedups. 
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Aerospace interest in stabilization and control of s/c attitude, using gravity-gradient and 
momentum exchange devices fostered the development of equations of motion for multibody 
systems. 
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Docking Cryogenic Coolers 
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RECENT ADVANCES 
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QUASI-STATIC • NO RAPID-TRANSIENT HEATING 

NEEDED FOR ENVIRONMENTAL, 
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MODELING 
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